[improve] [broker] Not close the socket if lookup failed caused by bundle unloading or metadata ex #21211

poorbarcode · 2023-09-20T16:33:08Z

Motivation

Background: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup.
see

pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java

Lines 1193 to 1200 in af20a8a

    
           private void checkServerError(ServerError error, String errMsg) { 
        
               if (ServerError.ServiceNotReady.equals(error)) { 
        
                   log.error("{} Close connection because received internal-server error {}", ctx.channel(), errMsg); 
        
                   ctx.close(); 
        
               } else if (ServerError.TooManyRequests.equals(error)) { 
        
                   incrementRejectsAndMaybeClose(); 
        
               } 
        
           }

Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient.

There are two cases that should be improved:

If the broker gets a metadata read/write error, the broker responds with a ServiceNotReady error, but it should respond with a MetadataError
If the topic is unloading, the broker responds with a ServiceNotReady error.

Modifications

Respond to the client with a MetadataError if the broker gets a metadata read/write error.
Respond to the client with a MetadataError if the topic is unloading

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: x

heesung-sn · 2023-09-20T19:28:33Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/lookup/TopicLookupBase.java

+            lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId));
+        } else {
+            log.warn("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg);
+            lookupFuture.complete(newLookupErrorResponse(ServerError.ServiceNotReady, errorMsg, requestId));


why do we need to return ServiceNotReady?

why not UnknownError?

Just to guarantee that uncontrollable errprs continue the previous behavior

heesung-sn · 2023-09-20T19:30:12Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/lookup/TopicLookupBase.java

+        if (unwrapEx instanceof IllegalStateException) {
+            // Current broker still hold the bundle's lock, but the bundle is being unloading.
+            log.info("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg);
+            lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId));


how do we know IllegalStateException is always MetadataError?

In the current case, the IllegalStateException is only throwing when the namespace bundle is unloading. See https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java#L453-L455C36

} else if (nsData.get().isDisabled()) { future.completeExceptionally( new IllegalStateException(String.format("Namespace bundle %s is being unloaded", bundle))); }

I agree with you, We should clearly define this exception. Since there are so many places that rely on the method NamespaceService.findBrokerServiceUrl, such as PulsarWebResource.validateTopicOwnershipAsync. We need a separate PR to do focus on it.

315157973 · 2023-09-24T03:39:15Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/lookup/TopicLookupBase.java

+        if (unwrapEx instanceof IllegalStateException) {
+            // Current broker still hold the bundle's lock, but the bundle is being unloading.
+            log.info("Failed to lookup {} for topic {} with error {}", clientAppId, topicName, errorMsg);
+            lookupFuture.complete(newLookupErrorResponse(ServerError.MetadataError, errorMsg, requestId));


There may be a side effect here.
Because the connection can be reset, the previous producer could fail fast when bundle unloaded and move to a new Broker.
This now causes each partition's producer to have to wait for a timeout.

@315157973

Because the connection can be reset, the previous producer could fail fast when bundle unloaded and move to a new Broker.

The previous producer will finally receive a CommandCloseProducer and try to reconnect even if the topic is closed without waiting for the client to disconnect, right?

This now causes each partition's producer to have to wait for a timeout.

The partition's producer will try to reconnect according to backoff's rules, which will not result in a timeout.

I also improve the test to ensure the producer and consumer are still working. See testLookupConnectionNotCloseIfGetUnloadingExOrMetadataEx.

Maybe I misunderstood what you meant, could you explain the details?

…ndle unloading or metadata ex

poorbarcode · 2023-09-24T16:03:31Z

Rebase master

lhotari · 2023-10-01T12:11:49Z

Flaky test #21292 detected by the flaky test script. @poorbarcode do you have a chance to check that?

poorbarcode · 2023-10-06T06:56:13Z

Flaky test #21292 detected by the flaky test script. @poorbarcode do you have a chance to check that?

Sure

…ndle unloading or metadata ex (#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading (cherry picked from commit 09a1720)

…ndle unloading or metadata ex (#21211) **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading (cherry picked from commit 09a1720)

…ndle unloading or metadata ex (apache#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading

…or namespace bundle ### Motivation When the broker failed to acquire the ownership of a namespace bundle by `LockBusyException`. It means there is another broker that has acquired the metadata store path and didn't release that path. For example: Broker 1: ``` 2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000 Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked ``` Broker 2: ``` 2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds ``` After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker. Here is another typical error: ``` 2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded ``` Though after apache/pulsar#21211, the server error becomes `MetadataError` rather than `ServiceNotReady`. However, since the `ServerError` is `ServiceNotReady`, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side. ### Modifications In `checkServerError`, when the error code is `ServiceNotReady`, check the error message as well, if it hit the case in `handleLookupError`, do not close the connection.

…ception ### Motivation When the broker failed to acquire the ownership of a namespace bundle by `LockBusyException`. It means there is another broker that has acquired the metadata store path and didn't release that path. For example: Broker 1: ``` 2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000 Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked ``` Broker 2: ``` 2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds ``` After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker. Here is another typical error: ``` 2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded ``` Though after apache/pulsar#21211, the server error becomes `MetadataError` rather than `ServiceNotReady`. However, since the `ServerError` is `ServiceNotReady`, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side. ### Modifications In `checkServerError`, when the error code is `ServiceNotReady`, check the error message as well, if it hit the case in `handleLookupError`, do not close the connection. Add `ConnectionTest` on a mocked `ClientConnection` object to verify `close()` will not be called.

…ception (#390) ### Motivation When the broker failed to acquire the ownership of a namespace bundle by `LockBusyException`. It means there is another broker that has acquired the metadata store path and didn't release that path. For example: Broker 1: ``` 2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000 Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked ``` Broker 2: ``` 2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds ``` After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker. Here is another typical error: ``` 2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded ``` Though after apache/pulsar#21211, the server error becomes `MetadataError` rather than `ServiceNotReady`. However, since the `ServerError` is `ServiceNotReady`, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side. ### Modifications In `checkServerError`, when the error code is `ServiceNotReady`, check the error message as well, if it hit the case in `handleLookupError`, do not close the connection. Add `ConnectionTest` on a mocked `ClientConnection` object to verify `close()` will not be called.

…ndle unloading or metadata ex (#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading

…ndle unloading or metadata ex (apache#21211) ### Motivation **Background**: The Pulsar client will close the socket if it receives a ServiceNotReady error when doing a lookup. Closing the socket causes the other consumer or producer to reconnect and does not make the lookup more efficient. There are two cases that should be improved: - If the broker gets a metadata read/write error, the broker responds with a `ServiceNotReady` error, but it should respond with a `MetadataError` - If the topic is unloading, the broker responds with a `ServiceNotReady` error. ### Modifications - Respond to the client with a `MetadataError` if the broker gets a metadata read/write error. - Respond to the client with a `MetadataError` if the topic is unloading (cherry picked from commit 16349e6)

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 20, 2023

poorbarcode requested review from hangc0276, 315157973 and codelipenghui September 20, 2023 16:34

poorbarcode self-assigned this Sep 20, 2023

poorbarcode added release/3.0.2 release/2.11.3 release/2.10.6 category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost labels Sep 20, 2023

poorbarcode added this to the 3.2.0 milestone Sep 20, 2023

merlimat requested a review from heesung-sn September 20, 2023 18:03

heesung-sn reviewed Sep 20, 2023

View reviewed changes

heesung-sn approved these changes Sep 21, 2023

View reviewed changes

315157973 reviewed Sep 24, 2023

View reviewed changes

[improve] [broker] Not close the socket if lookup failed caused by bu…

f133a9a

…ndle unloading or metadata ex

poorbarcode force-pushed the improve/lookup branch from b3f9e6e to f133a9a Compare September 24, 2023 16:03

poorbarcode requested a review from 315157973 September 27, 2023 07:38

fix test

dd37534

Technoboy- added the ready-to-test label Sep 27, 2023

Technoboy- approved these changes Sep 27, 2023

View reviewed changes

poorbarcode merged commit 09a1720 into apache:master Sep 27, 2023
46 checks passed

poorbarcode added the cherry-picked/branch-3.0 label Oct 7, 2023

poorbarcode added the cherry-picked/branch-2.11 label Oct 8, 2023

poorbarcode added the cherry-picked/branch-2.10 label Oct 8, 2023

BewareMyPower mentioned this pull request Jan 30, 2024

Do not close the socket when the broker failed due to MetadataStoreException apache/pulsar-client-cpp#390

Merged

Technoboy- added cherry-picked/branch-3.1 release/3.1.3 labels Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve] [broker] Not close the socket if lookup failed caused by bundle unloading or metadata ex #21211

[improve] [broker] Not close the socket if lookup failed caused by bundle unloading or metadata ex #21211

poorbarcode commented Sep 20, 2023 •

edited

Loading

heesung-sn Sep 20, 2023

poorbarcode Sep 21, 2023

heesung-sn Sep 20, 2023

poorbarcode Sep 21, 2023

315157973 Sep 24, 2023

poorbarcode Sep 24, 2023 •

edited

Loading

poorbarcode commented Sep 24, 2023

lhotari commented Oct 1, 2023

poorbarcode commented Oct 6, 2023

	private void checkServerError(ServerError error, String errMsg) {
	if (ServerError.ServiceNotReady.equals(error)) {
	log.error("{} Close connection because received internal-server error {}", ctx.channel(), errMsg);
	ctx.close();
	} else if (ServerError.TooManyRequests.equals(error)) {
	incrementRejectsAndMaybeClose();
	}
	}

[improve] [broker] Not close the socket if lookup failed caused by bundle unloading or metadata ex #21211

[improve] [broker] Not close the socket if lookup failed caused by bundle unloading or metadata ex #21211

Conversation

poorbarcode commented Sep 20, 2023 • edited Loading

Motivation

Modifications

Documentation

Matching PR in forked repository

heesung-sn Sep 20, 2023

Choose a reason for hiding this comment

poorbarcode Sep 21, 2023

Choose a reason for hiding this comment

heesung-sn Sep 20, 2023

Choose a reason for hiding this comment

poorbarcode Sep 21, 2023

Choose a reason for hiding this comment

315157973 Sep 24, 2023

Choose a reason for hiding this comment

poorbarcode Sep 24, 2023 • edited Loading

Choose a reason for hiding this comment

poorbarcode commented Sep 24, 2023

lhotari commented Oct 1, 2023

poorbarcode commented Oct 6, 2023

poorbarcode commented Sep 20, 2023 •

edited

Loading

poorbarcode Sep 24, 2023 •

edited

Loading